By Derek Lilienthal
This data was scraped from Dice.com from July 2021 to August 2021. It total, I have captured around 800+ ‘Data Science’ listings that I web scraped myself using Python. I used a lot of explicit programming rules while defining my word banks to capture things like skills, programming languages, technologies, methodologies, educational level, and years of experience mentioned in each ad. The complete dataset is all technology-related job postings because Dice.com aims to find jobs in a technology-related field. Because this dataset is is only captured over the course of a 30 day time frame from a relatively small job board website, this is merely a small representation on what the common things mentioned in job ads are that were posted on just one website.
Captured features
Snapshot of the dataset
| job_listing | skills | years_experience | company_name |
|---|---|---|---|
| Senior Data Scientist | Artificial Intelligence, Python, IT, SAS, SQL, PowerPoint, Foundation | 3 to 5 | New York Life Insurance Company |
| Data Scientist | Research, Computer, Programming, Python, Java, SQL, JavaScript, HTTP, SSL, Access | 2 | comScore |
| Data Scientist | Data, collect, clean, analyze | 3 to 5 | University Of Delaware |
| Principal Data Scientist - Search | Algorithms, Engineers, Python, Java, Data Mining, Computer, Research | 3 to 7 | Walmart |
| Data Scientist - Entry Level | Laboratory, Security, Applications, Java, Python, Matlab, Linux, UNIX, Windows | NA | Lawrence Livermore National Laboratory |
Data Cleaning
In order to get the data in a proper form to be tabulated, there was a lot of pre-processing that needed to happen. Mainly, there needed to be scripts writen that would tabulate things like skills, methodologies, ranges for years of experience, etc. However, because I am more fluent in Python than R, I did most of the heavy pre-processing of data in Python using libraries I am more familiar with using like Pandas and Numpy. I did however do some pre-processing in R as well.
Link to Python code used to pre-process the dataset: https://github.com/dblilienthal/Who-What-When-and-Where-are-the-Data-Scientist-Jobs/blob/main/Pre-processing%20the%20data.ipynb
Link to dashboard source code: https://github.com/dblilienthal/Who-What-When-and-Where-are-the-Data-Scientist-Jobs
Link to scraper: https://github.com/dblilienthal/Web-Scraping
There are a few companies who have been actively recruiting
There were 357 different companies who posted a job ad for a data scientist role
Aside from knowledge of statistics and general programming, there is a lot of skills not often talked about that are vital to know in order to be a Data Scientist.
Even though many of the jobs are located in major metropolitan areas, a significant portion are offered remotely.